Week 3 - Data Quality and tidyr

Emorie D Beck

Outline

  1. Documenting your design
  2. Building a codebook
  3. Cleaning your data using codebooks
  4. Problem set and Question time

Documenting your design

Documenting your design

  • Documentation is a critical part of open science, but not one we’re really taught
  • Documentation is going to look different for different types of research, but it’s not a hopeless cause to think about common features of documentation
  • Common Documentation:
    • Preregistration
    • Experiment Script (for standardizing across experimenters)
    • Survey / experimental files / stimuli / questions
    • Codebooks of all variables collected
    • Codebooks of variables used in a given study

Document your design

  • Today I want to touch on three things:
    • Preregistration (brief, mostly focusing on pointing you to resources)
    • Protocol and design flow
    • Codebooks of variables used in a given study (and how to use it in R)

Preregistration

  • Preregistration:
    • Specifying your study design, research questions, hypotheses, data cleaning, analytic plan, inference criteria, and robustness checks in advance
  • Why should you preregister?
    • Badges are fun
    • Preregistrations are not rigid but a chance to think through the questions you want to ask and answer and the challenges that might arise in doing so
    • Builds trust in the scientific process

Preregistration

  • Preregistration is hard
    • Specifying your plan in advance takes considerable effort and time, which can feel like very slow science
  • Preregistration is worthwhile
    • But preregistering plans, code, etc. can speed up the analytic portion of your research workflow, which builds great momentum for writing and submitting projects

What should I preregister?

  • Depends on the project, some examples include study design, individual research projects, etc.
    • Study design: A large survey is collected or a multi-part experiment is conducted. Measures, design, some research questions and hypotheses are specified a priori
    • Individual paper / project: A single-part survey or experiment is conducted or a specific piece of a multi-part study is investigated. If part of a multi-part study/experiment, should be linked to the parent preregistration

What should I preregister?

Learning More:

Protocol and Design Flow

  • Procedure sections in scientific papers are meant to map out, as concisely and simply as possible, how data were obtained (adhering to human subjects ethical codes, etc.)
  • But such sections are not sufficient to replicate or reproduce research because study designs are much more intricate and include many more details than what fits in a method section
    • e.g. measures not used because they weren’t focal, the code tha tunderlies how data are collected, preprocessing, etc.

Protocol and Design Flow

  • As researchers, it’s our job to make sure that the work we do is documented so well that someone could replicate our studies.
  • Think of it sort of like doing your taxes. You want to keep enough information that if you were audited, you would be able to quickly and easily provide all the relevant information.

Protocol and Design Flow

  • What you need to document will depend on the kind of work you do.
  • As an example, in my ecological momentary assessment work, I do the following:
    • Preregister the design
    • Write a methods section that includes text for every measure included in any part of the study as well as an extended and detailed procedure description. This also includes information on how data will be cleaned and composited
    • Detailed codebook including all measures that were collected, regardless of whether I have research questions or hypotheses for them. This is shareable for anyone who wants to use the data
    • Make technical workflow. This documents how all documents, scripts, etc. work together to produce the final result, including what is automated, what requires researcher action, etc.
    • Comment all code and documents extensively
    • Deviations document, where I document every deviation from my initial plans after the design is complete and data begin to be collected (or analyses start)

Protocol and Design Flow

  • Extensive documentation is also an investment in future you! My measures and procedures section basically write themselves, and my analytic plan is written in the preregistration
  • This both means that I’m faster and more efficient at writing these and that I feel more confident about the design choices I made, which is a win-win

Codebooks

Codebooks

  • For me, codebooks are the most essential and important part of any research project
  • Codebooks allow me to:
    • parse through documentation and find all the variables I want
    • document detailed information about each of those variables
    • make cleaning and compositing choices for each (e.g., renaming, recoding, removing missings, etc.)
    • differentiate among the kind of variables I have (e.g., predictors, outcomes, covariates, manipulations, and other categories)
    • Pass all this information into R to aid in data cleaning